Part 1

Author

Albert Rapp

Published

June 19, 2024

Part I

Lesson 1

Learning about the mechanics of wrangling data. Particularly mutate() and summarise() functions in an efficient manner of the databases to be used: penguins and ames.

The dimensions of the penguins database: 344 observations and 7 columns or variables

Code
dim(penguins)
[1] 344   7

To have a good overview of the database, we can use the skim() function from the skimr package. It provides a summary statistics, and report the proportion of missing cases

Code
palmerpenguins::penguins |> 
  skimr::skim()
Data summary
Name palmerpenguins::penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

An alternative to skimr for descriptive statistics is gt_plt_summary() function of the gtExtras package. For it to work, you need to select the variables in advance.

Code
library(tidyverse)
library(gtExtras)
library(gt)

penguins |>
  select(bill_length_mm,bill_depth_mm) |>
  gt_plt_summary()
select(penguins, bill_length_mm, bill_depth_mm)
344 rows x 2 cols
Column Plot Overview Missing Mean Median SD
bill_length_mm 3260 0.6% 43.9 44.5 5.5
bill_depth_mm 13.121.5 0.6% 17.2 17.3 2.0

If you want to focus on one variable, you can deposit it into the skim() function. For instance, examining the properties of Lot Frontage and Lot_Area

Code
modeldata::ames |> 
  skimr::skim(Lot_Frontage,Lot_Area)
Data summary
Name modeldata::ames
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Lot_Frontage 0 1 57.65 33.50 0 43.00 63.0 78.00 313 ▇▇▁▁▁
Lot_Area 0 1 10147.92 7880.02 1300 7440.25 9436.5 11555.25 215245 ▇▁▁▁▁

Notice the ames data has not nice variable names in that they are capitalized. Good idea to use janitor::clean_names to convert capital letters into lower letters.

Code
modeldata::ames |> 
  janitor::clean_names() |> 
  skimr::skim(lot_frontage,lot_area)
Data summary
Name janitor::clean_names(mode…
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lot_frontage 0 1 57.65 33.50 0 43.00 63.0 78.00 313 ▇▇▁▁▁
lot_area 0 1 10147.92 7880.02 1300 7440.25 9436.5 11555.25 215245 ▇▁▁▁▁

Lesson 2

The focus of this lesson is the mutate() function. It also deals with vectorized vs iterative calculations associated with with the mutate() function.

It also introduces the map() function. This function aligns a function to a vector, compelling the estimating of one-by-one instead of vector mashing. The variations of map() discussed include the map_lgl() (map logical) and the map_dbl() (map numeric or double). The first one produces a logical vector. The second produces numeric values of 1 (True) and 0 (False).

Handling missing cases

It is important to handle missing values before doing mutate(). We can verify missing cases were removed using skimr. Remember that the skim function revealed that sex had 11 missing cases, or the completion ratio was 97%.

Code
library(palmerpenguins)
library(tidyverse)
library(modeldata)
library(janitor)

#Removing missing cases using !is.na(sex)

penguins_wo_NAs <-  palmerpenguins::penguins |> 
  filter(!is.na(sex))

skimr::skim(penguins_wo_NAs)
Data summary
Name penguins_wo_NAs
Number of rows 333
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 146, Gen: 119, Chi: 68
island 0 1 FALSE 3 Bis: 163, Dre: 123, Tor: 47
sex 0 1 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 0 1 43.99 5.47 32.1 39.5 44.5 48.6 59.6 ▃▇▇▆▁
bill_depth_mm 0 1 17.16 1.97 13.1 15.6 17.3 18.7 21.5 ▅▆▇▇▂
flipper_length_mm 0 1 200.97 14.02 172.0 190.0 197.0 213.0 231.0 ▂▇▃▅▃
body_mass_g 0 1 4207.06 805.22 2700.0 3550.0 4050.0 4775.0 6300.0 ▃▇▅▃▂
year 0 1 2008.04 0.81 2007.0 2007.0 2008.0 2009.0 2009.0 ▇▁▇▁▇

Handling vectors

Mutate is a vectorised function. This means that it takes the two vectors listed in the mutate function as a whole. In other words, the values of each column are not taken one-by-one. The values are handled as a vector, and then they are operated via a vectorized division (/).

Imagine these two vectors are “smashed together” and the result is divided.

Code
penguins_wo_NAs |> 
  mutate(
    bill_flipper_ratio = bill_length_mm / flipper_length_mm, na.rm = TRUE
  ) |> 
  select(bill_length_mm, bill_depth_mm, bill_flipper_ratio)
# A tibble: 333 × 3
   bill_length_mm bill_depth_mm bill_flipper_ratio
            <dbl>         <dbl>              <dbl>
 1           39.1          18.7              0.216
 2           39.5          17.4              0.212
 3           40.3          18                0.207
 4           36.7          19.3              0.190
 5           39.3          20.6              0.207
 6           38.9          17.8              0.215
 7           39.2          19.6              0.201
 8           41.1          17.6              0.226
 9           38.6          21.2              0.202
10           34.6          21.1              0.175
# ℹ 323 more rows

Example where the vectorized condition may not be met. Using a custom function to illustrate this problem. Next let’s apply the function to one of the vectors used in estimating bill_flipper_ratio. In this case, bill_length_mm. Our hope is that we would get a vector of TRUE and FALSE.

Notice that the application of the function prompted an error message. It was traced back to the if (x > 39) ...: The result shows the condition has length > 1.

Code
large_quantity <-  function(x) {
  if (x > 39) {
    return(TRUE)
   } else {
    return(FALSE)
  }
}

# penguins_wo_NAs |>
#   mutate(
#     large_bill_length = large_quantity(bill_length_mm)
#     ) |>
#   select(bill_length_mm,large_bill_length)


#Reconstructing the essence of the function as a vector
c(34, 234) > 39
[1] FALSE  TRUE

Functional programming or map()

map_lgl() enforces the one-by-one calculation instead of the vector calculation

To address the limitation of working with vectors, we need functional programming. map() allows one-to-one calculation operation. This operation is called upon by the large_quantity function within the mutate() function.

In other words, instead of using the vector smashing approach to handle observations contained in the variable, we rely on handling each variable listed in the mutate function.

We rely on the map() function. In this case the logical map function ( map_lgl()). And then we instruct R to apply the large_quantity() function to each component, one-by-one, of the column or vector bill_lenght_mm. Notice, map_lgl() lists the vector or column first, and then the function to be applied to it. In other words, map_lgl() applies the function to each element of the vector or column instead of smashing the vector.

“The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.”

Code
large_quantity <-  function(x) {
  if (x > 39) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

penguins_wo_NAs |>
  mutate(
    large_bill_length = map_lgl(
      bill_length_mm,
    #Applying large_quantity function to vector bill_lenght_mm
      large_quantity) ) |> 
  select(bill_length_mm,large_bill_length)
# A tibble: 333 × 2
   bill_length_mm large_bill_length
            <dbl> <lgl>            
 1           39.1 TRUE             
 2           39.5 TRUE             
 3           40.3 TRUE             
 4           36.7 FALSE            
 5           39.3 TRUE             
 6           38.9 FALSE            
 7           39.2 TRUE             
 8           41.1 TRUE             
 9           38.6 FALSE            
10           34.6 FALSE            
# ℹ 323 more rows

Another variation is the map() itself. But instead of producing individual values, map() produces lists of values. However, the list of variables are reported in a vector.

Code
large_quantity <-  function(x) {
  if (x > 39) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

penguins_wo_NAs |>
  mutate(
    large_bill_length = map(
      bill_length_mm,
  #Applyng large function to vector bill_lenth_mm
      large_quantity) ) |> 
  select(bill_length_mm,large_bill_length)
# A tibble: 333 × 2
   bill_length_mm large_bill_length
            <dbl> <list>           
 1           39.1 <lgl [1]>        
 2           39.5 <lgl [1]>        
 3           40.3 <lgl [1]>        
 4           36.7 <lgl [1]>        
 5           39.3 <lgl [1]>        
 6           38.9 <lgl [1]>        
 7           39.2 <lgl [1]>        
 8           41.1 <lgl [1]>        
 9           38.6 <lgl [1]>        
10           34.6 <lgl [1]>        
# ℹ 323 more rows

Still another variation of the map() function is the map_dbl(). It reports numbers (1 = TRUE, 0 = FALSE) instead of logical statements. Notice, we list first the column to be modified followed by the function (large_quantity).

Code
large_quantity <-  function(x) {
  if (x > 39) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

penguins_wo_NAs |>
  mutate(
    large_bill_length = map_dbl(
      bill_length_mm,
      large_quantity) ) |> 
  select(bill_length_mm,large_bill_length)
# A tibble: 333 × 2
   bill_length_mm large_bill_length
            <dbl>             <dbl>
 1           39.1                 1
 2           39.5                 1
 3           40.3                 1
 4           36.7                 0
 5           39.3                 1
 6           38.9                 0
 7           39.2                 1
 8           41.1                 1
 9           38.6                 0
10           34.6                 0
# ℹ 323 more rows

Lesson 3

This section covers the summarise()function. This function is the opposite of the mutate() function. It also works with vectors, but the difference is the amount of output. While mutate can be used to produce vectors, summarise turns out single elements, or atomic values.

Code
penguins_wo_NAs |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE) )
# A tibble: 1 × 2
  mean_flipper_length sd_flipper_length
                <dbl>             <dbl>
1                201.              14.0

You can also save the lengths into one single list using the list() function. In other words, the last column, labeled flipper_lenghts() contains a vector of values. This last column has 333 values corresponding to 333 penguins’ flipper lengths.

Code
penguins_wo_NAs |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
    flipper_lengths = list(flipper_length_mm))
# A tibble: 1 × 3
  mean_flipper_length sd_flipper_length flipper_lengths
                <dbl>             <dbl> <list>         
1                201.              14.0 <int [333]>    

Lesson 4

Handling missing values with common calculations. If not handled, the summarise function would generate NAs. One option is to use na.rm = TRUE.

Code
palmerpenguins::penguins |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
    flipper_lengths = list(flipper_length_mm))
# A tibble: 1 × 3
  mean_flipper_length sd_flipper_length flipper_lengths
                <dbl>             <dbl> <list>         
1                201.              14.1 <int [344]>    

Lesson 5

Repeated calculations for subsets of data. Two options to automate repeated calculations: for() loop. However this approach is messy and tedious.

Code
for (selected_species in unique(penguins_wo_NAs$species)) {
  penguins_wo_NAs |> 
    filter(species == selected_species) |> 
    summarise(
      mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
      sd_flipper_lenth = sd(flipper_length_mm, na.rm = TRUE)) |> 
    print()
}
# A tibble: 1 × 2
  mean_flipper_length sd_flipper_lenth
                <dbl>            <dbl>
1                190.             6.52
# A tibble: 1 × 2
  mean_flipper_length sd_flipper_lenth
                <dbl>            <dbl>
1                217.             6.59
# A tibble: 1 × 2
  mean_flipper_length sd_flipper_lenth
                <dbl>            <dbl>
1                196.             7.13

Another is to rely on loop calculations. The best alternative is to rely on the group approach.

An example is to rely on the data without missing values. One option is to rely on the option .by = () to specify subgroups. Be careful about omitting the “.” It would produce an output but would miss summarizing results by species.

Code
penguins_wo_NAs |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
    flipper_lengths = list(flipper_length_mm),
    .by = species)
# A tibble: 3 × 4
  species   mean_flipper_length sd_flipper_length flipper_lengths
  <fct>                   <dbl>             <dbl> <list>         
1 Adelie                   190.              6.52 <int [146]>    
2 Gentoo                   217.              6.59 <int [119]>    
3 Chinstrap                196.              7.13 <int [68]>     

Warning regarding grouping in a series of concatenated computations as show below. group_by() affects the original structure of the data by the first variable listed in the group_by() function.

Notice the output indicates #Groups : species[3] Or 3 species by 5 islands. Adding another separate summarise() function generates a problem. Suppose we want to estimate the mean of the mean_flipper_length estimated in the previous run as well as the sd of the sd_flipper_lenght variable previously created. The end result is an sd_flipper_length with two NAs.

During the first calculation, group_by() strips the last group alluded in the function; namely island variable. In other words, the variable or column island is eliminated. The end result is a 5 X 5 tibble table organized by three groups corresponding to the 3 species. This is an opportunity to mistakes if one ignores the structure imposed on the data by group_by() function. Grouped calculations behave differently.

Code
penguins_wo_NAs |> 
  group_by(species, island) |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm),
    sd_flipper_length = sd(flipper_length_mm),
    flipper_lengths = list(flipper_length_mm))  |> 
  #The next second summarise creates a problem
  summarise(
    mean_flipper_length = mean(mean_flipper_length),
    sd_flipper_length = sd(sd_flipper_length)
  )
# A tibble: 3 × 3
  species   mean_flipper_length sd_flipper_length
  <fct>                   <dbl>             <dbl>
1 Adelie                   190.             0.255
2 Chinstrap                196.            NA    
3 Gentoo                   217.            NA    

To remove the underlying grouping structure, remove the grouping using the option .groups = “drop’ in the summarise() function. This option restores the original tibble table of 5 X 5.

Code
penguins_wo_NAs |> 
  group_by(species, island) |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm),
    sd_flipper_length = sd(flipper_length_mm),
    flipper_lengths = list(flipper_length_mm),
    .groups = 'drop')
# A tibble: 5 × 5
  species   island    mean_flipper_length sd_flipper_length flipper_lengths
  <fct>     <fct>                   <dbl>             <dbl> <list>         
1 Adelie    Biscoe                   189.              6.73 <int [44]>     
2 Adelie    Dream                    190.              6.48 <int [55]>     
3 Adelie    Torgersen                192.              6.22 <int [47]>     
4 Chinstrap Dream                    196.              7.13 <int [68]>     
5 Gentoo    Biscoe                   217.              6.59 <int [119]>    

Another alternative to .groups = "drop" is the function .by = c(species,island) , which also restores the original data structure. Notice that in this case, the .by = function replaces the group_by() function.

Code
# Notice: we don't use the group_by() function

penguins_wo_NAs |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm),
    sd_flipper_length = sd(flipper_length_mm),
    flipper_lengths = list(flipper_length_mm),
    .by = c(species, island))
# A tibble: 5 × 5
  species   island    mean_flipper_length sd_flipper_length flipper_lengths
  <fct>     <fct>                   <dbl>             <dbl> <list>         
1 Adelie    Torgersen                192.              6.22 <int [47]>     
2 Adelie    Biscoe                   189.              6.73 <int [44]>     
3 Adelie    Dream                    190.              6.48 <int [55]>     
4 Gentoo    Biscoe                   217.              6.59 <int [119]>    
5 Chinstrap Dream                    196.              7.13 <int [68]>     

Lesson 6

Unique function

Using grouping .by =function with mutate can produce a long list of means. In this case, a long list of means by species. Notice, the mean column only reports one single mean value in the screen. To see the rest of the means across species, you need to create a column, say test, and print it.

Code
palmerpenguins::penguins |> 
  mutate(
    mean_column = mean(bill_length_mm, na.rm = TRUE),
    .by = species) |> 
  select(bill_length_mm, flipper_length_mm, mean_column)
# A tibble: 344 × 3
   bill_length_mm flipper_length_mm mean_column
            <dbl>             <int>       <dbl>
 1           39.1               181        38.8
 2           39.5               186        38.8
 3           40.3               195        38.8
 4           NA                  NA        38.8
 5           36.7               193        38.8
 6           39.3               190        38.8
 7           38.9               181        38.8
 8           39.2               195        38.8
 9           34.1               193        38.8
10           42                 190        38.8
# ℹ 334 more rows

In other to make sense of the the listing, let’s just select the species column and the mean column.

Code
palmerpenguins::penguins |> 
  mutate(
    mean_column = mean(bill_length_mm, na.rm = TRUE),
    .by = species) |> 
  select(species, mean_column)
# A tibble: 344 × 2
   species mean_column
   <fct>         <dbl>
 1 Adelie         38.8
 2 Adelie         38.8
 3 Adelie         38.8
 4 Adelie         38.8
 5 Adelie         38.8
 6 Adelie         38.8
 7 Adelie         38.8
 8 Adelie         38.8
 9 Adelie         38.8
10 Adelie         38.8
# ℹ 334 more rows

In order to avoid a long list of 344 means by species, use the unique() function. The unique function eliminates duplicates. There were many duplicates of the mean for each species which unique eliminated by reporting three species.

The .by = species forces that the calculations take place separately by the components of the column referenced to, in this case, 3 species of penguins. unique(), on the other hand, excludes duplicated means within each species or rows.

Code
palmerpenguins::penguins |> 
  mutate(
    mean_column = mean(bill_length_mm, na.rm = TRUE),
    .by = species) |>
  select(species,mean_column)|>
  unique()
# A tibble: 3 × 2
  species   mean_column
  <fct>           <dbl>
1 Adelie           38.8
2 Gentoo           47.5
3 Chinstrap        48.8

Centering a variable by its mean

Centering, using scale() function, a variable represents an ideal situation to use the function .by = in a mutate operator instead of using summarise. A situation to use the scale() instead of reporting the means for each group as we did with mutate(). scale() is a generic R function whose default method centers and/or scales the columns of a numeric matrix.”

Notice, we are centering, or scaling each penguin’s species bill length to its mean. Assuring transparency by requesting alpha = 0.5.

Centering is a technique to help display the distribution as normal. It is also a popular technique in machine learning. It also helps to locate the three species’s distribution within a common line. It allows comparisons across species, or group scaling.

Code
penguins |> 
  filter(!is.na(sex)) |> 
  mutate(bill_length_mm = scale(bill_length_mm), 
         .by = species) |> 
  ggplot() +
  geom_density(aes(x = bill_length_mm, 
                   fill = species,
                   alpha = 0.5))

Without centering, you can see that the three distribution have different scales. The lengths within each distribution vary between 30 to 60. So the location within each species is different.

Code
palmerpenguins::penguins |> 
  # mutate(
  #   bill_length_mm = scale(bill_length_mm),
  #   .by = species) |>
  ggplot() +
  geom_density(aes(x = bill_length_mm, fill = species),
               alpha = 0.5) 

Another option is to center within each species instead of across species. As shown below it produced a graph without a common center. This is why if you want to make group comparisons, you need to use the .by option, which produces a common center or zero point.

Code
penguins |> 
  filter(!is.na(sex)) |> 
  mutate(bill_length_mm = scale(bill_length_mm), 
         # by = species
         ) |>
  ggplot() +
  geom_density(aes(x = bill_length_mm, 
                   fill = species,
                   alpha = 0.5))

Lesson 7

Missing data with group calculations

Notice not all combinations between species and islands are displayed. For instance Chinstrap -Biscoe, Chinstrap - Torgesen, Gentoo - Dream, and Gentoo-Torgeson combinations are not displayed.

To be reviewed later on - Last review November 15, 2024

Code
penguins |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
    flipper_lengths = list(flipper_length_mm),
    .by = c(species,island) ) 
# A tibble: 5 × 5
  species   island    mean_flipper_length sd_flipper_length flipper_lengths
  <fct>     <fct>                   <dbl>             <dbl> <list>         
1 Adelie    Torgersen                191.              6.23 <int [52]>     
2 Adelie    Biscoe                   189.              6.73 <int [44]>     
3 Adelie    Dream                    190.              6.59 <int [56]>     
4 Gentoo    Biscoe                   217.              6.48 <int [124]>    
5 Chinstrap Dream                    196.              7.13 <int [68]>     

To display all combinations between species and islands, even for those combinations with no observations we could use the following complete() function.

complete() allows one to display all combinations among variables even with missing data. It also allows to change the default NA value for another, say 0, using the fill() function.

Code
penguins_wo_NAs |> 
  summarise(
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
    flipper_lengths = list(flipper_length_mm),
    .by = c(species,island) ) |> 
  complete(
    species, island,
    fill = list(mean_flipper_length = 0,
                sd_flipper_length = 0)
  )
# A tibble: 9 × 5
  species   island    mean_flipper_length sd_flipper_length flipper_lengths
  <fct>     <fct>                   <dbl>             <dbl> <list>         
1 Adelie    Biscoe                   189.              6.73 <int [44]>     
2 Adelie    Dream                    190.              6.48 <int [55]>     
3 Adelie    Torgersen                192.              6.22 <int [47]>     
4 Chinstrap Biscoe                     0               0    <NULL>         
5 Chinstrap Dream                    196.              7.13 <int [68]>     
6 Chinstrap Torgersen                  0               0    <NULL>         
7 Gentoo    Biscoe                   217.              6.59 <int [119]>    
8 Gentoo    Dream                      0               0    <NULL>         
9 Gentoo    Torgersen                  0               0    <NULL>         

Lesson 8

Using across() + mutate() instead of repetitions.

Original mutate

Code
penguins_wo_NAs |> 
  summarise(
    mean_bill_depth = mean(bill_depth_mm),
    mean_bill_length = mean(bill_length_mm),
    mean_flipper_length = mean(flipper_length_mm),
    mean_body_mass = mean(body_mass_g))
# A tibble: 1 × 4
  mean_bill_depth mean_bill_length mean_flipper_length mean_body_mass
            <dbl>            <dbl>               <dbl>          <dbl>
1            17.2             44.0                201.          4207.

Using ’across()` function. Several variations.

across() function facilitates eliminating repetition.

Also using glue() to affix a name to the corresponding output. This is done via the functions .names = 'mean_{.col}'. This glue function affix the legend mean_ to all the columns/variables referenced in the .cols = c( function.

Notice the output does not display the prefix mean_. If you want it you need the option .names , which is a glue specification.

Code
library(glue)

#Long option
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = 'mean_{.col}' #Glueing 
  ))
# A tibble: 1 × 4
  mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
                <dbl>              <dbl>                  <dbl>            <dbl>
1                44.0               17.2                   201.            4207.
Code
#Columns option
penguins_wo_NAs |> 
  summarise(across(.cols = 3:6, 
                   function(x) mean(x, na.rm = TRUE),
                   .names = 'mean_{.col}'))
# A tibble: 1 × 4
  mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
                <dbl>              <dbl>                  <dbl>            <dbl>
1                44.0               17.2                   201.            4207.
Code
#where(is.numeric)
penguins_wo_NAs |> 
  summarise(across(where(is.numeric), 
                   function(x) mean(x, na.rm = TRUE),
                   .names = 'mean_{.col}') )
# A tibble: 1 × 5
  mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
                <dbl>              <dbl>                  <dbl>            <dbl>
1                44.0               17.2                   201.            4207.
# ℹ 1 more variable: mean_year <dbl>
Code
#Selecting columns with specific names
penguins_wo_NAs |> 
  summarise(across(contains('bill'), 
                   function(x) mean(x, na.rm = TRUE),
                   .names = 'mean_{.col}') )
# A tibble: 1 × 2
  mean_bill_length_mm mean_bill_depth_mm
                <dbl>              <dbl>
1                44.0               17.2
Code
#Selecting columns with specific names
penguins_wo_NAs |> 
  summarise(across(starts_with('flipper'), 
                   function(x) mean(x, na.rm = TRUE),
                   .names = 'mean_{.col}') )
# A tibble: 1 × 1
  mean_flipper_length_mm
                   <dbl>
1                   201.

Removing unwanted strings

Note that we could remove unwanted strings such as _mm and _g.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = 'mean_{.col |> 
    str_remove("_mm") |> 
    str_remove("_g")}'
  ))
# A tibble: 1 × 4
  mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass
             <dbl>           <dbl>               <dbl>          <dbl>
1             44.0            17.2                201.          4207.

Another option is to include the .fn in the names function. But unfortunately it did not do the trick because the prefix mean was removed. Instead a prexif 1_ was added .

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = '{.fn}_{.col |> str_remove("_mm") |> str_remove("_g")}'
  ))
# A tibble: 1 × 4
  `1_bill_length` `1_bill_depth` `1_flipper_length` `1_body_mass`
            <dbl>          <dbl>              <dbl>         <dbl>
1            44.0           17.2               201.         4207.

To make it work correctly, we need to use .fns = list(mean = mean). This is the best option to display the names of the summary statistics called for. The list() gives a name to the function.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = list(avg = mean), #list gives a name to the function 
    .names = '{.fn}_{.col |> str_remove("_mm") |> str_remove("_g")}'
  ))
# A tibble: 1 × 4
  avg_bill_length avg_bill_depth avg_flipper_length avg_body_mass
            <dbl>          <dbl>              <dbl>         <dbl>
1            44.0           17.2               201.         4207.

You can compute more summary statistics by adding estimation of standard deviation.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = list(avg = mean, standard_deviation = sd),
    .names = '{.fn}_{.col |> str_remove("_mm") |> str_remove("_g")}'
  ))
# A tibble: 1 × 8
  avg_bill_length standard_deviation_bil…¹ avg_bill_depth standard_deviation_b…²
            <dbl>                    <dbl>          <dbl>                  <dbl>
1            44.0                     5.47           17.2                   1.97
# ℹ abbreviated names: ¹​standard_deviation_bill_length,
#   ²​standard_deviation_bill_depth
# ℹ 4 more variables: avg_flipper_length <dbl>,
#   standard_deviation_flipper_length <dbl>, avg_body_mass <dbl>,
#   standard_deviation_body_mass <dbl>

Lesson 9

Using across() + mutate( ) instead of repetitions

Using across() inside mutate() facilitates cleaning multiple columns in one single shot. In this example, we would be using across() along with the parse_number() to extract the first number inside a text.

We get a warning but the numbers were parsed or extracted from the strings. The end result is a tibble table with numbers instead of characters.

parse_number extracts the number of any string whose text has numbers embedded into them.

Code
tibble(
  text1 = c("I have 10 apples", "I have 20 appples"),
  text2 = c("I have 30 apples", "I have 40 apples")
) |> 
  mutate(
    across(
      .cols = everything(),
      .fns = parse_number
    )
  )
# A tibble: 2 × 2
  text1 text2
  <dbl> <dbl>
1    10    30
2    20    40

If you want to include both numbers and text, you need to add a new column as follows:

Code
tibble(
  text1 = c("I have 10 apples", "I have 20 appples"),
  text2 = c("I have 30 apples", "I have 40 apples")
) |> 
  mutate(
    across(
      .cols = everything(),
      .fns = parse_number,
      .names = "number_{.col}"
    )
  )
# A tibble: 2 × 4
  text1             text2            number_text1 number_text2
  <chr>             <chr>                   <dbl>        <dbl>
1 I have 10 apples  I have 30 apples           10           30
2 I have 20 appples I have 40 apples           20           40

Lesson 10

Adding calculations next to across()

Recapping a prior example below. The operations listed after summarize is just one of the calculations that could be further added. Below the str_remove() removes text listed within parenthesis across all columns.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = 'mean_{.col |> str_remove("_mm")}'
  ))
# A tibble: 1 × 4
  mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
             <dbl>           <dbl>               <dbl>            <dbl>
1             44.0            17.2                201.            4207.
Code
#Or, you can glue the label associated to the mean
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = list(Avg = mean),
    .names = '{.fn}_{.col |> str_remove("_mm")}'
  ))
# A tibble: 1 × 4
  Avg_bill_length Avg_bill_depth Avg_flipper_length Avg_body_mass_g
            <dbl>          <dbl>              <dbl>           <dbl>
1            44.0           17.2               201.           4207.

Adding more operations to summarise(). Say you want to add a standard deviation, say sd for mean_bill_length. All what you need to do is to add a line.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = 'mean_{.col |> str_remove("_mm")}'
  ),
  sd_bill_length = sd(bill_length_mm)
  )
# A tibble: 1 × 5
  mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
             <dbl>           <dbl>               <dbl>            <dbl>
1             44.0            17.2                201.            4207.
# ℹ 1 more variable: sd_bill_length <dbl>

Lesson 11

Grouped calculations with across()

In the last lesson we learned that not only can we use across() for multiple computations, but also we can combine it with other analyses. This means we could do across calculations on a group basis as well. In short, we can perform powerful estimations in a concise manner.

In the example below, we can add the estimation of means and standard deviations by species.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = 'mean_{.col |> str_remove("_mm")}'
  ),
  .by = species
  )
# A tibble: 3 × 5
  species  mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
  <fct>               <dbl>           <dbl>               <dbl>            <dbl>
1 Adelie               38.8            18.3                190.            3706.
2 Gentoo               47.6            15.0                217.            5092.
3 Chinstr…             48.8            18.4                196.            3733.

We can extend the grouping to species by islands

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = mean,
    .names = 'mean_{.col |> str_remove("_mm")}'
  ),
  .by = c(species,island)
  )
# A tibble: 5 × 6
  species   island    mean_bill_length mean_bill_depth mean_flipper_length
  <fct>     <fct>                <dbl>           <dbl>               <dbl>
1 Adelie    Torgersen             39.0            18.5                192.
2 Adelie    Biscoe                39.0            18.4                189.
3 Adelie    Dream                 38.5            18.2                190.
4 Gentoo    Biscoe                47.6            15.0                217.
5 Chinstrap Dream                 48.8            18.4                196.
# ℹ 1 more variable: mean_body_mass_g <dbl>

An extension of this idea is to add another variable outside the function, in this case estimating the standard deviation of bill_lenght_mm.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    .fns = list(Average = mean),
    .names = '{.fn}_{.col |> str_remove("_mm") |> str_remove("_g")}'
  ),
  .by = c(species,island),
  std_bill_length = sd(bill_length_mm)
  )
# A tibble: 5 × 7
  species   island Average_bill_length Average_bill_depth Average_flipper_length
  <fct>     <fct>                <dbl>              <dbl>                  <dbl>
1 Adelie    Torge…                39.0               18.5                   192.
2 Adelie    Biscoe                39.0               18.4                   189.
3 Adelie    Dream                 38.5               18.2                   190.
4 Gentoo    Biscoe                47.6               15.0                   217.
5 Chinstrap Dream                 48.8               18.4                   196.
# ℹ 2 more variables: Average_body_mass <dbl>, std_bill_length <dbl>

Lesson 12

Getting to know the reframe() function

reframe() is a function sitting between mutate and summarize. This function has an arbitrary length between one, which is associated with summarise(), and multiple, which is associated with mutate().

The range() function is relevant. It tells you the minimum and maximum values of a column.

If we use range() within summarise() we get an error message, asking to use reframe() instead.

In short, reframe() is a highly flexible function that replicates many of the flexibility of summarise() function. It can generate labels and grouping by another variable. It can also accommodate across()and its multiple options.

Code
range(penguins_wo_NAs$bill_length_mm)
[1] 32.1 59.6
Code
# penguins_wo_NAs |>
#   summarise(range_bill_lenght_mm = range(bill_length_mm))

#Using reframe() function instead of summarise()
penguins_wo_NAs |> 
  reframe(range_bill_lenght_mm = range(bill_length_mm))
# A tibble: 2 × 1
  range_bill_lenght_mm
                 <dbl>
1                 32.1
2                 59.6

Adding labels via type( ) to reframe()

Since reframe() along range() would return unnamed results, we could add type() for adding the labels to be added into the output.

Code
penguins_wo_NAs |> 
  reframe(
    type = c("min", "max"),
    range_bill_lenght_mm = range(bill_length_mm),
    .by = c(species) )
# A tibble: 6 × 3
  species   type  range_bill_lenght_mm
  <fct>     <chr>                <dbl>
1 Adelie    min                   32.1
2 Adelie    max                   46  
3 Gentoo    min                   40.9
4 Gentoo    max                   59.6
5 Chinstrap min                   40.9
6 Chinstrap max                   58  

Grouping with reframe( )

We can do more things with reframe( ). We can use group statistics as well.

Code
penguins_wo_NAs |> 
  reframe(
    type = c("min", "max"),
    range_bill_lenght_mm = range(bill_length_mm),
    .by = species)
# A tibble: 6 × 3
  species   type  range_bill_lenght_mm
  <fct>     <chr>                <dbl>
1 Adelie    min                   32.1
2 Adelie    max                   46  
3 Gentoo    min                   40.9
4 Gentoo    max                   59.6
5 Chinstrap min                   40.9
6 Chinstrap max                   58  

We can also add across() and request range() within .fns =. This approach insert the across( ) function within the reframe( )function.

Code
penguins_wo_NAs |> 
  reframe(
    type = c("min", "max"),
    across(.cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
           .fns = range
             ),
    .by = species)
# A tibble: 6 × 6
  species   type  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <chr>          <dbl>         <dbl>             <int>       <int>
1 Adelie    min             32.1          15.5               172        2850
2 Adelie    max             46            21.5               210        4775
3 Gentoo    min             40.9          13.1               203        3950
4 Gentoo    max             59.6          17.3               231        6300
5 Chinstrap min             40.9          16.4               178        2700
6 Chinstrap max             58            20.8               212        4800

Creating tables with reframe()

But we could also format it as a table. To do so, we need to call upon pivot _longer( ) function as well as the and pivot_wider( ) function.

Note: We could have used this option for the CHCI report documenting entering years across cohorts by programs.

Code
library(gt)
library(gtExtras)

#Pivot longer
penguins_wo_NAs |> 
  reframe(
    type = c("min", "max"),
    across(.cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
           .fns = range
             ),
    .by = species) |> 
  pivot_longer(
    cols = -c(1:2),
    names_to = "quantity",
    values_to = 'value'
  ) |> 
  pivot_wider(
    id_cols = c(quantity),
    names_from = c(species,type),
    values_from = value
  ) |> 
  gt()
quantity Adelie_min Adelie_max Gentoo_min Gentoo_max Chinstrap_min Chinstrap_max
bill_length_mm 32.1 46.0 40.9 59.6 40.9 58.0
bill_depth_mm 15.5 21.5 13.1 17.3 16.4 20.8
flipper_length_mm 172.0 210.0 203.0 231.0 178.0 212.0
body_mass_g 2850.0 4775.0 3950.0 6300.0 2700.0 4800.0

Lesson 13

Motivation for using tidyselect helpers

They are powerful techniques. Allows one to select the correct columns or variables. For instance, the .fns = allows one to label the variables. Then those labels are collected via the .names = function (see below):

Code
penguins_wo_NAs |> 
  summarise(
    across(
    .cols = c("bill_length_mm",
              "bill_depth_mm",
              "flipper_length_mm",
              "body_mass_g"),
    #Adding labels to the summary statistics
    .fns = list(avg = mean),
    #Glueing summary stat labels & column names 
    .names = '{.fn}_{.col |> str_remove("_mm")}'),
  .by = c(species,island))
# A tibble: 5 × 6
  species   island    avg_bill_length avg_bill_depth avg_flipper_length
  <fct>     <fct>               <dbl>          <dbl>              <dbl>
1 Adelie    Torgersen            39.0           18.5               192.
2 Adelie    Biscoe               39.0           18.4               189.
3 Adelie    Dream                38.5           18.2               190.
4 Gentoo    Biscoe               47.6           15.0               217.
5 Chinstrap Dream                48.8           18.4               196.
# ℹ 1 more variable: avg_body_mass_g <dbl>

Instead of listing the columns where the means and standard deviations are to be computed, we could use tidy helpers to facilitate locating them. The tidyselect to be used is where. It works in combination with another function, is.numeric(), that finds the desired property of the column or variable.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = where(is.numeric),
    .fns = list(avg = mean),
    .names = '{.fn}_{.col |> str_remove("_mm")}'
  ),
  .by = c(species,island))
# A tibble: 5 × 7
  species   island    avg_bill_length avg_bill_depth avg_flipper_length
  <fct>     <fct>               <dbl>          <dbl>              <dbl>
1 Adelie    Torgersen            39.0           18.5               192.
2 Adelie    Biscoe               39.0           18.4               189.
3 Adelie    Dream                38.5           18.2               190.
4 Gentoo    Biscoe               47.6           15.0               217.
5 Chinstrap Dream                48.8           18.4               196.
# ℹ 2 more variables: avg_body_mass_g <dbl>, avg_year <dbl>

Notice this omnibus statement includes the column year, which is categorical. In other to exclude this kind of variable from the omnibus statement we can add the -(minus) to exclude the factor variable year. Notice we need to encircle it within the concatenate function c( ) . The script c(where(is.numeric), -year) concatenates the tidy helper while excluding the column year.

Code
penguins_wo_NAs |> 
  summarise(across(
    .cols = c(where(is.numeric), -year),
    .fns = list(avg = mean),
    .names = '{.fn}_{.col |> str_remove("_mm")}'
  ),
  .by = c(species)
  )
# A tibble: 3 × 5
  species   avg_bill_length avg_bill_depth avg_flipper_length avg_body_mass_g
  <fct>               <dbl>          <dbl>              <dbl>           <dbl>
1 Adelie               38.8           18.3               190.           3706.
2 Gentoo               47.6           15.0               217.           5092.
3 Chinstrap            48.8           18.4               196.           3733.

Lesson 14

Using anonymous function

Another custom function for across() function. Using the original penguins dataset. Performing the function, you will get missing data.

How to avoid these situations of ending up with NAs. We can use the mean function with na.rm option. The way to do so is to specify a new function that also specifies na.rm = TRUE. We can do so via an anonymous function(x)

Three options to use anonymous functions:

Code
#Option 1: removing NAs within the mean function
palmerpenguins::penguins |> 
  summarise(across(
    .cols = c(where(is.numeric), -year),
    .fns = function(x) mean(x, na.rm = TRUE),
    .names = 'mean_{.col |> str_remove("_mm")}'
  ),
  .by = c(species,island))
# A tibble: 5 × 6
  species   island    mean_bill_length mean_bill_depth mean_flipper_length
  <fct>     <fct>                <dbl>           <dbl>               <dbl>
1 Adelie    Torgersen             39.0            18.4                191.
2 Adelie    Biscoe                39.0            18.4                189.
3 Adelie    Dream                 38.5            18.3                190.
4 Gentoo    Biscoe                47.5            15.0                217.
5 Chinstrap Dream                 48.8            18.4                196.
# ℹ 1 more variable: mean_body_mass_g <dbl>
Code
#Option 2: removing NAs in the function
palmerpenguins::penguins |> 
  summarise(across(
    .cols = c(where(is.numeric), -year),
    .fns = \(x) mean(x, na.rm = TRUE),
    .names = 'mean_{.col |> str_remove("_mm")}'
  ),
  .by = c(species) )
# A tibble: 3 × 5
  species  mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
  <fct>               <dbl>           <dbl>               <dbl>            <dbl>
1 Adelie               38.8            18.3                190.            3701.
2 Gentoo               47.5            15.0                217.            5076.
3 Chinstr…             48.8            18.4                196.            3733.
Code
#Option 3 creating a prefix for the new variables

palmerpenguins::penguins |> 
  summarise(across(
    .cols = c(where(is.numeric), -year),
    .fns = list(avg = function(x) mean(x, na.rm = TRUE)),
    .names = '{.fn}_{.col |> str_remove("_mm")}'
  ),
  .by = c(species,island))
# A tibble: 5 × 6
  species   island    avg_bill_length avg_bill_depth avg_flipper_length
  <fct>     <fct>               <dbl>          <dbl>              <dbl>
1 Adelie    Torgersen            39.0           18.4               191.
2 Adelie    Biscoe               39.0           18.4               189.
3 Adelie    Dream                38.5           18.3               190.
4 Gentoo    Biscoe               47.5           15.0               217.
5 Chinstrap Dream                48.8           18.4               196.
# ℹ 1 more variable: avg_body_mass_g <dbl>

Lesson 15

This lesson shows the most important tidy select helpers; namely, is.numeric, is.factor, is.character.

Selecting the right data by column/variable types. Using select along with where(is.numeric) screens the database in search of variables that are numeric. We can eliminate non numeric variables as well. In the example below, we exclude the column year.

The where() function can be used in several scenarios. One is to select numeric variables. Another for selecting factor variables. And finally selecting both factor and character variables.

Using where( ) function to select variables by measurement (numeric, factor, character)

In this example we only want to include numeric variables. We are excluding year. This variable although coded as numerical is at the end categorical.

Code
#Scenario 1: selecting numeric variables while excludying count variable

penguins_wo_NAs |> 
  select(where(is.numeric), -year) 
# A tibble: 333 × 4
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
            <dbl>         <dbl>             <int>       <int>
 1           39.1          18.7               181        3750
 2           39.5          17.4               186        3800
 3           40.3          18                 195        3250
 4           36.7          19.3               193        3450
 5           39.3          20.6               190        3650
 6           38.9          17.8               181        3625
 7           39.2          19.6               195        4675
 8           41.1          17.6               182        3200
 9           38.6          21.2               191        3800
10           34.6          21.1               198        4400
# ℹ 323 more rows
Code
#Scenario 2: selecting variables that are factors

penguins_wo_NAs |> 
  select(where(is.factor))
# A tibble: 333 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen female
 5 Adelie  Torgersen male  
 6 Adelie  Torgersen female
 7 Adelie  Torgersen male  
 8 Adelie  Torgersen female
 9 Adelie  Torgersen male  
10 Adelie  Torgersen male  
# ℹ 323 more rows
Code
# Scenario 3: selecting factor and character variables

penguins_wo_NAs |> 
  select(where(is.factor), where(is.character))
# A tibble: 333 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen female
 5 Adelie  Torgersen male  
 6 Adelie  Torgersen female
 7 Adelie  Torgersen male  
 8 Adelie  Torgersen female
 9 Adelie  Torgersen male  
10 Adelie  Torgersen male  
# ℹ 323 more rows

Using anonymous functions inside where()

Alternatively, you can define a custom function. Inside the where() function you can use the anonymous function. Combine the two with an or ||.

Consequently the defined function is TRUE if the corresponding column/variable is a factor or a character. If numeric, it would return a FALSE.

Code
penguins_wo_NAs |> 
  select(where(function(x) is.factor(x) || is.character(x)) )
# A tibble: 333 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen female
 5 Adelie  Torgersen male  
 6 Adelie  Torgersen female
 7 Adelie  Torgersen male  
 8 Adelie  Torgersen female
 9 Adelie  Torgersen male  
10 Adelie  Torgersen male  
# ℹ 323 more rows

Anonymous functions and if conditions for selecting variables

Defining custom function to identify numeric columns. And we could add another condition such as identifying the mean to be above a particular value.

So here we defined an anonymous function that checks whether a variable is numeric and where the mean of the variable is above 200.

The difference between the double && and the single & operator is that the single operator generates a warning message. The argument is not numeric or logical.

Code
penguins_wo_NAs |> 
  select(where(function(x) is.numeric(x) && mean(x) > 200 ) )
# A tibble: 333 × 3
   flipper_length_mm body_mass_g  year
               <int>       <int> <int>
 1               181        3750  2007
 2               186        3800  2007
 3               195        3250  2007
 4               193        3450  2007
 5               190        3650  2007
 6               181        3625  2007
 7               195        4675  2007
 8               182        3200  2007
 9               191        3800  2007
10               198        4400  2007
# ℹ 323 more rows

Anonymous function to identify NAs

Another handy option is to get columns where specific percentage of values are missing. In this case asking for variables with more than 1% of missing cases.

It is usually practical to name this function for future use.

Code
palmerpenguins::penguins |> 
  select(where(function(x) mean(is.na(x)) > 0.01) )
# A tibble: 344 × 1
   sex   
   <fct> 
 1 male  
 2 female
 3 female
 4 <NA>  
 5 female
 6 male  
 7 female
 8 male  
 9 <NA>  
10 <NA>  
# ℹ 334 more rows

Creating function to identify variables with missing cases

Sometimes is better to save a function and give it a name for future use. The name should tell what the function does. In this case, we want to save a function that identifies the columns or variables having more than 1% missing cases. We label it more_than_one_percent_missing.

Code
#Creating function
more_than_one_percent_missing <- function(x) mean(is.na(x)) > 0.01

#Using function

palmerpenguins::penguins |> 
  select(where(more_than_one_percent_missing ))
# A tibble: 344 × 1
   sex   
   <fct> 
 1 male  
 2 female
 3 female
 4 <NA>  
 5 female
 6 male  
 7 female
 8 male  
 9 <NA>  
10 <NA>  
# ℹ 334 more rows

Lesson 16

This lesson further examines tidy helpers for selecting columns based on names or strings embedded into the column/variable.

  1. Select the right data by column name

  2. Selecting consecutive columns. Between bill_length and sex column

  3. Selecting columns by specific variable’s labels.

    1. Selecting variables ending with _mm
  4. Selecting columns with start with a particular string.

    1. Selecting variables starting with the string bill.
  5. Selecting columns that contain similar strings.

    1. Selecting variables with the string lenght

Selecting consecutive columns

Selecting columns from bill_length_mm to sex.

Code
#Selecting consecutive columns
penguins_wo_NAs |> 
  select(bill_length_mm:sex)
# A tibble: 333 × 5
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
            <dbl>         <dbl>             <int>       <int> <fct> 
 1           39.1          18.7               181        3750 male  
 2           39.5          17.4               186        3800 female
 3           40.3          18                 195        3250 female
 4           36.7          19.3               193        3450 female
 5           39.3          20.6               190        3650 male  
 6           38.9          17.8               181        3625 female
 7           39.2          19.6               195        4675 male  
 8           41.1          17.6               182        3200 female
 9           38.6          21.2               191        3800 male  
10           34.6          21.1               198        4400 male  
# ℹ 323 more rows

Selecting columns ending with a particular string

Code
#Selecting columns ending with _mm
penguins_wo_NAs |> 
  select(ends_with('_mm'))
# A tibble: 333 × 3
   bill_length_mm bill_depth_mm flipper_length_mm
            <dbl>         <dbl>             <int>
 1           39.1          18.7               181
 2           39.5          17.4               186
 3           40.3          18                 195
 4           36.7          19.3               193
 5           39.3          20.6               190
 6           38.9          17.8               181
 7           39.2          19.6               195
 8           41.1          17.6               182
 9           38.6          21.2               191
10           34.6          21.1               198
# ℹ 323 more rows

Selecting columns that start with a particular string

Selecting variables containing the string bill.

Code
#Selecting columns starting with bill

penguins_wo_NAs |> 
  select(starts_with('bill'))
# A tibble: 333 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           39.1          18.7
 2           39.5          17.4
 3           40.3          18  
 4           36.7          19.3
 5           39.3          20.6
 6           38.9          17.8
 7           39.2          19.6
 8           41.1          17.6
 9           38.6          21.2
10           34.6          21.1
# ℹ 323 more rows

Selecting columns contain a particular string

Code
#Selecting columns the string lenght
penguins_wo_NAs |> 
  select(contains('length'))
# A tibble: 333 × 2
   bill_length_mm flipper_length_mm
            <dbl>             <int>
 1           39.1               181
 2           39.5               186
 3           40.3               195
 4           36.7               193
 5           39.3               190
 6           38.9               181
 7           39.2               195
 8           41.1               182
 9           38.6               191
10           34.6               198
# ℹ 323 more rows

Lesson 17

Bringing it all together.

Using the housing dataset from the ames database from modeldata library.

Code
library(modeldata)
library(janitor)
library(skimr)
library(modeldata)
library(gt)
library(gtExtras)

ames <- modeldata::ames |> 
  janitor::clean_names()


  
#Examining dataset using skimr

ames |> 
  clean_names() |> 
  skim()
Data summary
Name clean_names(ames)
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
factor 40
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ms_sub_class 0 1 FALSE 16 One: 1079, Two: 575, One: 287, One: 192
ms_zoning 0 1 FALSE 7 Res: 2273, Res: 462, Flo: 139, Res: 27
street 0 1 FALSE 2 Pav: 2918, Grv: 12
alley 0 1 FALSE 3 No_: 2732, Gra: 120, Pav: 78
lot_shape 0 1 FALSE 4 Reg: 1859, Sli: 979, Mod: 76, Irr: 16
land_contour 0 1 FALSE 4 Lvl: 2633, HLS: 120, Bnk: 117, Low: 60
utilities 0 1 FALSE 3 All: 2927, NoS: 2, NoS: 1
lot_config 0 1 FALSE 5 Ins: 2140, Cor: 511, Cul: 180, FR2: 85
land_slope 0 1 FALSE 3 Gtl: 2789, Mod: 125, Sev: 16
neighborhood 0 1 FALSE 28 Nor: 443, Col: 267, Old: 239, Edw: 194
condition_1 0 1 FALSE 9 Nor: 2522, Fee: 164, Art: 92, RRA: 50
condition_2 0 1 FALSE 8 Nor: 2900, Fee: 13, Art: 5, Pos: 4
bldg_type 0 1 FALSE 5 One: 2425, Twn: 233, Dup: 109, Twn: 101
house_style 0 1 FALSE 8 One: 1481, Two: 873, One: 314, SLv: 128
overall_cond 0 1 FALSE 9 Ave: 1654, Abo: 533, Goo: 390, Ver: 144
roof_style 0 1 FALSE 6 Gab: 2321, Hip: 551, Gam: 22, Fla: 20
roof_matl 0 1 FALSE 8 Com: 2887, Tar: 23, WdS: 9, WdS: 7
exterior_1st 0 1 FALSE 16 Vin: 1026, Met: 450, HdB: 442, Wd : 420
exterior_2nd 0 1 FALSE 17 Vin: 1015, Met: 447, HdB: 406, Wd : 397
mas_vnr_type 0 1 FALSE 5 Non: 1775, Brk: 880, Sto: 249, Brk: 25
exter_cond 0 1 FALSE 5 Typ: 2549, Goo: 299, Fai: 67, Exc: 12
foundation 0 1 FALSE 6 PCo: 1310, CBl: 1244, Brk: 311, Sla: 49
bsmt_cond 0 1 FALSE 6 Typ: 2616, Goo: 122, Fai: 104, No_: 80
bsmt_exposure 0 1 FALSE 5 No: 1906, Av: 418, Gd: 284, Mn: 239
bsmt_fin_type_1 0 1 FALSE 7 GLQ: 859, Unf: 851, ALQ: 429, Rec: 288
bsmt_fin_type_2 0 1 FALSE 7 Unf: 2499, Rec: 106, LwQ: 89, No_: 81
heating 0 1 FALSE 6 Gas: 2885, Gas: 27, Gra: 9, Wal: 6
heating_qc 0 1 FALSE 5 Exc: 1495, Typ: 864, Goo: 476, Fai: 92
central_air 0 1 FALSE 2 Y: 2734, N: 196
electrical 0 1 FALSE 6 SBr: 2682, Fus: 188, Fus: 50, Fus: 8
functional 0 1 FALSE 8 Typ: 2728, Min: 70, Min: 65, Mod: 35
garage_type 0 1 FALSE 7 Att: 1731, Det: 782, Bui: 186, No_: 157
garage_finish 0 1 FALSE 4 Unf: 1231, RFn: 812, Fin: 728, No_: 159
garage_cond 0 1 FALSE 6 Typ: 2665, No_: 159, Fai: 74, Goo: 15
paved_drive 0 1 FALSE 3 Pav: 2652, Dir: 216, Par: 62
pool_qc 0 1 FALSE 5 No_: 2917, Exc: 4, Goo: 4, Typ: 3
fence 0 1 FALSE 5 No_: 2358, Min: 330, Goo: 118, Goo: 112
misc_feature 0 1 FALSE 6 Non: 2824, She: 95, Gar: 5, Oth: 4
sale_type 0 1 FALSE 10 WD : 2536, New: 239, COD: 87, Con: 26
sale_condition 0 1 FALSE 6 Nor: 2413, Par: 245, Abn: 190, Fam: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lot_frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
lot_area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
year_built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
year_remod_add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
mas_vnr_area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
bsmt_fin_sf_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
bsmt_fin_sf_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
bsmt_unf_sf 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
total_bsmt_sf 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
first_flr_sf 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
second_flr_sf 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
gr_liv_area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
bsmt_full_bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
bsmt_half_bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
full_bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
half_bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
bedroom_abv_gr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
kitchen_abv_gr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
tot_rms_abv_grd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
garage_cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
garage_area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
wood_deck_sf 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
open_porch_sf 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
enclosed_porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
screen_porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
pool_area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
misc_val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
mo_sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
year_sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
sale_price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
longitude 0 1 -93.64 0.03 -93.69 -93.66 -93.64 -93.62 -93.58 ▅▅▇▆▁
latitude 0 1 42.03 0.02 41.99 42.02 42.03 42.05 42.06 ▂▂▇▇▇
Code
#Examining a couple of numeric variables using gt_plt_summary
ames |> 
  clean_names() |> 
  select(sale_price,garage_cars) |> 
  gt_plt_summary()
select(clean_names(ames), sale_price, garage_cars)
2930 rows x 2 cols
Column Plot Overview Missing Mean Median SD
sale_price 13K755K 0.0% 180,796.1 160,000.0 79,886.7
garage_cars 05 0.0% 1.8 2.0 0.8

Transforming columns as tibble

Focusing our attention to numeric variables. We can also transform the variables a tibble while accessing the ames file. We select those ames’ variables that are numeric.

Code
# A tibble: 34 × 12
   skim_type skim_variable  n_missing complete_rate numeric.mean numeric.sd
   <chr>     <chr>              <int>         <dbl>        <dbl>      <dbl>
 1 numeric   lot_frontage           0             1        57.6       33.5 
 2 numeric   lot_area               0             1     10148.      7880.  
 3 numeric   year_built             0             1      1971.        30.2 
 4 numeric   year_remod_add         0             1      1984.        20.9 
 5 numeric   mas_vnr_area           0             1       101.       179.  
 6 numeric   bsmt_fin_sf_1          0             1         4.18       2.23
 7 numeric   bsmt_fin_sf_2          0             1        49.7      169.  
 8 numeric   bsmt_unf_sf            0             1       559.       440.  
 9 numeric   total_bsmt_sf          0             1      1051.       441.  
10 numeric   first_flr_sf           0             1      1160.       392.  
# ℹ 24 more rows
# ℹ 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
#   numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>
Code
ames_numeric <- ames |> 
  select(where(is.numeric)) |> 
  as_tibble()

Conducting summary statistics by focusing on variables that are counts. To do so we define a custom function that checks for those variables that are counts. Notice the function screens whether the values of the variable range from 0 to 10. This anonymous function would give us a string of FALSE or TRUE values. Using the all() function to check if the values of the variables range from 0 to 10. Notice the tidy helper all_of() is not applied, just all

Next we can use this information to identify continuous variables. The ames_numeric anonymous function identified 9 count variables. If you want to identify continuous variables negate the !all(x %in% 0:10). This revised anonymous function yielded 25 variables that are continuous.

Code
library(modeldata)
library(janitor)
library(skimr)
library(modeldata)

#Selecting counting variables whose values range from 0 to 10

ames_numeric |> 
  #Selecting columns with integers from 0 to 10
  select(where(function(x) all(x %in% 0:10) ) )
# A tibble: 2,930 × 9
   bsmt_fin_sf_1 bsmt_full_bath bsmt_half_bath full_bath half_bath
           <dbl>          <dbl>          <dbl>     <int>     <int>
 1             2              1              0         1         0
 2             6              0              0         1         0
 3             1              0              0         1         1
 4             1              1              0         2         1
 5             3              0              0         2         1
 6             3              0              0         2         1
 7             3              1              0         2         0
 8             1              0              0         2         0
 9             3              1              0         2         0
10             7              0              0         2         1
# ℹ 2,920 more rows
# ℹ 4 more variables: bedroom_abv_gr <int>, kitchen_abv_gr <int>,
#   fireplaces <int>, garage_cars <dbl>
Code
#To identify continuous variables negate the !all(x %in% 0:10)

ames_non_counts <- ames_numeric |> 
  select(where(function(x) !all(x %in% 0:10) ) )

Using the data subset of continuous variables. Let’s focus on columns that contain the word area. Next identify variables with the sf string

Code
ames_non_counts |> 
  select(contains('area'))
# A tibble: 2,930 × 5
   lot_area mas_vnr_area gr_liv_area garage_area pool_area
      <int>        <dbl>       <int>       <dbl>     <int>
 1    31770          112        1656         528         0
 2    11622            0         896         730         0
 3    14267          108        1329         312         0
 4    11160            0        2110         522         0
 5    13830            0        1629         482         0
 6     9978           20        1604         470         0
 7     4920            0        1338         582         0
 8     5005            0        1280         506         0
 9     5389            0        1616         608         0
10     7500            0        1804         442         0
# ℹ 2,920 more rows
Code
ames_non_counts |> 
  select(contains('sf'))
# A tibble: 2,930 × 7
   bsmt_fin_sf_2 bsmt_unf_sf total_bsmt_sf first_flr_sf second_flr_sf
           <dbl>       <dbl>         <dbl>        <int>         <int>
 1             0         441          1080         1656             0
 2           144         270           882          896             0
 3             0         406          1329         1329             0
 4             0        1045          2110         2110             0
 5             0         137           928          928           701
 6             0         324           926          926           678
 7             0         722          1338         1338             0
 8             0        1017          1280         1280             0
 9             0         415          1595         1616             0
10             0         994           994         1028           776
# ℹ 2,920 more rows
# ℹ 2 more variables: wood_deck_sf <int>, open_porch_sf <int>

Eyeballing, it appears that garage and living area variables are the product of adding first floor square feet and second floor square feet. If this hypothesis is true excluding observations that result in adding these two variables would produce an empty rows across all variables.

The hypothesis was true for some rows or cases but not for 40 cases.

# A tibble: 40 × 25
   lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_2
          <dbl>    <int>      <int>          <int>        <dbl>         <dbl>
 1           50     5500       1929           2001            0           692
 2           50     9350       1947           1979            0             0
 3           60    10410       1916           1987            0             0
 4           60    10410       1915           1950            0             0
 5           60    12144       1949           1950            0             0
 6           60     8064       1948           1994            0             0
 7           98     8820       1890           1996            0             0
 8           60     5400       1920           1950            0             0
 9           92     5520       1912           1950            0             0
10           52     6240       1925           1950            0             0
# ℹ 30 more rows
# ℹ 19 more variables: bsmt_unf_sf <dbl>, total_bsmt_sf <dbl>,
#   first_flr_sf <int>, second_flr_sf <int>, gr_liv_area <int>,
#   tot_rms_abv_grd <int>, garage_area <dbl>, wood_deck_sf <int>,
#   open_porch_sf <int>, enclosed_porch <int>, three_season_porch <int>,
#   screen_porch <int>, pool_area <int>, misc_val <int>, mo_sold <int>,
#   year_sold <int>, sale_price <int>, longitude <dbl>, latitude <dbl>

Sale price and calculating averages

Focusing on sale price.

Code
ames |> 
  skim(sale_price)
Data summary
Name ames
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sale_price 0 1 180796.1 79886.69 12789 129500 160000 213500 755000 ▇▇▁▁▁
Code
#* Checking sale price distribution as well. Notice median < mean 
#* suggesting the distribution is positively skewed.

ames |> 
  select(sale_price) |> 
  gt_plt_summary()
select(ames, sale_price)
2930 rows x 1 cols
Column Plot Overview Missing Mean Median SD
sale_price 13K755K 0.0% 180,796.1 160,000.0 79,886.7

Calculating average prices

Notice, Albert introduces the function to request skewness. Skewness belongs to the e1071 library. And to have it displayed in a nice format, we could pivot_longer.

Using across to estimate min, max, average, median, sd, and skeweness.

To have a nice display of the summary statistics, we could opt for pivoting longer the dataset.

In pivoting_longer, we opt to arrange all of the columns using cols = everything(). Notice the median is lower than the mean signifying the distribution is positively skewed.

The end result is a summary statistics table

Code
library(tidyverse)
library(gt)

#Calculating averages using across

ames |> 
  na.omit() |> 
  summarise(
    across(
      .cols = sale_price,
      .fns = list(
        min = min,
        avg = mean,
        median = median,
        max = max,
        standard_dev = sd,
        skewness = e1071::skewness),
      .names = '{.fn}'
    )
  ) |> 
#Pivoting longer to create a summary stats table
  pivot_longer(
    cols = everything()
  )
# A tibble: 6 × 2
  name             value
  <chr>            <dbl>
1 min           12789   
2 avg          180796.  
3 median       160000   
4 max          755000   
5 standard_dev  79887.  
6 skewness          1.74

Focusing on the neighborhood column

The neighborhood column has 28 values, which makes it difficult to handle. Consequently, we could lump together values. The functionfct_lump_n() lumps all levels except for the n most frequent (or least frequent if n < 0). In this case, we are focusing on the top 5 most frequent neighborhoods lumping together the rest. It produces 6 neighborhoods. Other is a collection of many neighborhoods.

Code
library(modeldata)
library(janitor)
library(skimr)
library(modeldata)


ames |> 
  count(neighborhood) |> 
  print(n = Inf)
# A tibble: 28 × 2
   neighborhood                                n
   <fct>                                   <int>
 1 North_Ames                                443
 2 College_Creek                             267
 3 Old_Town                                  239
 4 Edwards                                   194
 5 Somerset                                  182
 6 Northridge_Heights                        166
 7 Gilbert                                   165
 8 Sawyer                                    151
 9 Northwest_Ames                            131
10 Sawyer_West                               125
11 Mitchell                                  114
12 Brookside                                 108
13 Crawford                                  103
14 Iowa_DOT_and_Rail_Road                     93
15 Timberland                                 72
16 Northridge                                 71
17 Stone_Brook                                51
18 South_and_West_of_Iowa_State_University    48
19 Clear_Creek                                44
20 Meadow_Village                             37
21 Briardale                                  30
22 Bloomington_Heights                        28
23 Veenker                                    24
24 Northpark_Villa                            23
25 Blueste                                    10
26 Greens                                      8
27 Green_Hills                                 2
28 Landmark                                    1
Code
ames |>
  mutate(neighborhood = fct_lump_n(
    neighborhood,
    n = 5)
    ) |> 
  skim(neighborhood)
Data summary
Name mutate(ames, neighborhood…
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
factor 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
neighborhood 0 1 FALSE 6 Oth: 1605, Nor: 443, Col: 267, Old: 239
Code
ames_top_five <-  ames |>
  mutate(neighborhood = fct_lump_n(
    neighborhood,
    n = 5)
    ) 

ames_top_five |> 
  select(neighborhood) |> 
  count(neighborhood) |> 
  print(n = Inf)
# A tibble: 6 × 2
  neighborhood      n
  <fct>         <int>
1 North_Ames      443
2 College_Creek   267
3 Old_Town        239
4 Edwards         194
5 Somerset        182
6 Other          1605

Visualizing the data

With this new database, we can do data visualizations comparing the house prices across those 5 neighborhoods. We can use facet_wrap for the neighborhood. We can also use the scale_x_log10() to normalize the x-axis.

Code
ames_top_five |> 
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
  facet_wrap(~ neighborhood, nrow = 6) +
  scale_x_log10(labels = scales::label_dollar()) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )

Refraining data visualization with minimum and maximum sale price

We can make this figure more informative by adding labels for the minimum and maximum sale price. To do so, we can add a geom_text() layer. The values for the layer can be extracted from the function reframe() .

The reframe() function allows one to calculate the sale price range. It would give a tibble that has the range of prices. Next we can add the label type to capture the labels minimum and maximum via type = c("min", "max"). And lets create those values by neighborhood using the function .by = neighborhood.

We can save the values in a dataframe, say ranges. In short, the reframe( ) function generates a long data file containing the min and max sale prices across neighborhoods.

Code
ames_top_five |> 
  reframe(
    type = c("min", "max"),
    sale_price = range(sale_price),
    .by = neighborhood
  )
# A tibble: 12 × 3
   neighborhood  type  sale_price
   <fct>         <chr>      <int>
 1 North_Ames    min        68000
 2 North_Ames    max       345000
 3 Other         min        13100
 4 Other         max       755000
 5 Somerset      min       139000
 6 Somerset      max       468000
 7 Old_Town      min        12789
 8 Old_Town      max       475000
 9 Edwards       min        35000
10 Edwards       max       415000
11 College_Creek min       110000
12 College_Creek max       475000
Code
ranges <- ames_top_five |> 
  reframe(
    type = c("min", "max"),
    sale_price = range(sale_price),
    .by = neighborhood
  )

Adding text layer

Modifying ggplot to add a geom_text() layer capturing the labels associated to each of the 5 neighborhoods.

In the geom_text(), we specify the data, the aes(). For y we specify -1 and for x we rely on sale_price’s ranges, and for label we refer to the variable containing the sale price. Notice we rely on y = -1 to place the labels below the chart.

Code
ames_top_five |> 
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
  geom_text(data = ranges, 
            aes(y = -1, label = sale_price)) +
  facet_wrap(vars(neighborhood), nrow = 6) +
  scale_x_log10(labels = scales::label_dollar()) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )

Improving the legend

Our first attempt with the labels produce characters too big. We can do this legend nicer by passing it to scales::dollar() function.

Code
ames_top_five |> 
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
  geom_text(data = ranges, 
            aes(y = -1, label = sale_price |> 
                  scales::dollar()) ) + 
  facet_wrap(~ neighborhood, nrow = 6) +
  scale_x_log10(labels = scales::label_dollar()) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )

To make it more obvious what those labels stand for, we can add another geom_point() layer. It adds points to the min and max ranges in the chart. We can also shape as a triangle-like instead of a point (shape = 5).

Code
ames_top_five |> 
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
   geom_point(data = ranges, 
            aes(y = -0.3),
            shape = 5,
            size = 2) +
  geom_text(data = ranges, 
            aes(y = -1.2, label = sale_price |> 
                  scales::dollar()), size = 2.5 ) + 
  facet_wrap(~ neighborhood, nrow = 6) +
  scale_x_log10(labels = scales::label_dollar()) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )

Visualizing count variables

Another data visualization dealing with count variables. In this case, we identified 9 count variables. Let’s check the column names. There are variables that include kitchen and basement

Code
ames_numeric |>
  select(where(function(x) all(x %in% 0:10) ) ) |>
  colnames()
[1] "bsmt_fin_sf_1"  "bsmt_full_bath" "bsmt_half_bath" "full_bath"     
[5] "half_bath"      "bedroom_abv_gr" "kitchen_abv_gr" "fireplaces"    
[9] "garage_cars"   

We can reduce the number of count variables We end up with 7 variables instead of 9 by excluding variables related to basement and kitchen. Keeping variables with less than 6 values reduces the number of variables to 4.

As shown by skim, the values for the 4 variables range from 0 to 5.

Code
selected_counts <- ames_numeric |>
  select(where(function(x) all(x %in% 0:5) ) ) |>
  select(
    -contains('bsmt'),
    -contains('kitchen')
  )

selected_counts |>
  skim()
Data summary
Name selected_counts
Number of rows 2930
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
full_bath 0 1 1.57 0.55 0 1 2 2 4 ▁▇▇▁▁
half_bath 0 1 0.38 0.50 0 0 0 1 2 ▇▁▅▁▁
fireplaces 0 1 0.60 0.65 0 0 1 1 4 ▇▇▁▁▁
garage_cars 0 1 1.77 0.76 0 1 2 2 5 ▅▇▂▁▁

We can mutate again with across(), and transforming the count variables into factor variables using the .fns = factor. Notice that for most variables the frequencies gravitate around 0 or 1. Consequently, we could lump together the variables.

Code
selected_counts |>
  mutate(
    across(
      .cols = everything(),
      .fns = factor )
    ) |>
  skim()
Data summary
Name mutate(selected_counts, a…
Number of rows 2930
Number of columns 4
_______________________
Column type frequency:
factor 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
full_bath 0 1 FALSE 5 2: 1532, 1: 1318, 3: 64, 0: 12
half_bath 0 1 FALSE 3 0: 1843, 1: 1062, 2: 25
fireplaces 0 1 FALSE 5 0: 1422, 1: 1274, 2: 221, 3: 12
garage_cars 0 1 FALSE 6 2: 1603, 1: 778, 3: 374, 0: 158

Lumping factor variables

Lumping together factor variables with limited range. Instead of using fct_lump_n, we could use if_else to create variables capturing the lump sum. In so doing, we can use across() along an anonymous or custom function that leaves numbers smaller than 2 as is, but numbers greater than 2 are replaced with the text 2+. The function to do so is if_else().

Code
# selected_counts |>
#   mutate(
#     across(
#       .cols = everything(),
#       .fns = function(x) if_else(x < 2, x, '2+')
#   ) ) |>
#   skim()

Transforming numeric values into characters

Notice the function generated an error. Using ‘2+’ along with factor variables prompted R to report not being able to combine integers with characters in the same function.

Consequently, we need to transform all the value labels into a character format. We can do so by using as.character( ) function in the anonymous function. Notice the as.character(x), '2+' indicates that values equal or greater than 2 would be coded with the character ‘2+’ .

The skim() function reveals there are three unique values. The range goes from 0 to 1. Notice the character ‘2+’ is not listed in the range. However, skim reports 3 unique values, ‘2+’ being one of them (0, 1, “2+”).

Code
selected_counts |>
  mutate(
    across(
      .cols = everything(),
      .fns = function(x) if_else(x < 2,
                                 as.character(x), '2+')
  ) ) |>
  skim()
Data summary
Name mutate(…)
Number of rows 2930
Number of columns 4
_______________________
Column type frequency:
character 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
full_bath 0 1 1 2 0 3 0
half_bath 0 1 1 2 0 3 0
fireplaces 0 1 1 2 0 3 0
garage_cars 0 1 1 2 0 3 0

Next, we need to transform back the 3 values as factors by adding the factor() function at the end. Notice the levels = c(0:1, ‘2+’)

Code
selected_counts |>
  mutate(
    across(
      .cols = everything(),
      .fns = function(x) if_else(x < 2,
                                 as.character(x),
                                 '2+'
                                 ) |>
        factor(levels = c(0:1,'2+')) ) ) |>
  skim()
Data summary
Name mutate(…)
Number of rows 2930
Number of columns 4
_______________________
Column type frequency:
factor 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
full_bath 0 1 FALSE 3 2+: 1600, 1: 1318, 0: 12
half_bath 0 1 FALSE 3 0: 1843, 1: 1062, 2+: 25
fireplaces 0 1 FALSE 3 0: 1422, 1: 1274, 2+: 234
garage_cars 0 1 FALSE 3 2+: 1994, 1: 778, 0: 158

We can modify the dataset to include the house sale price, while excluding variables whose values are greater than 6. We also exclude basement and kitchen variables.

Code
selected_counts <- ames_numeric |>
  select(
    sale_price,
    where(function(x) all(x %in% 0:5) ) ) |>
  select(
    -contains('bsmt'),
    -contains('kitchen')
  )

selected_counts
# A tibble: 2,930 × 5
   sale_price full_bath half_bath fireplaces garage_cars
        <int>     <int>     <int>      <int>       <dbl>
 1     215000         1         0          2           2
 2     105000         1         0          0           1
 3     172000         1         1          0           1
 4     244000         2         1          2           2
 5     189900         2         1          1           2
 6     195500         2         1          1           2
 7     213500         2         0          0           2
 8     191500         2         0          0           2
 9     236500         2         0          1           2
10     189000         2         1          1           2
# ℹ 2,920 more rows

And then, we have to make sure our across() call includes all count variables, while excluding sale_price column across(.cols = -sale_price). The end result is the grouped_counts which includes count variables and the sale_price, which is a continuous variable.

Grouped accounts using across(.cols = -sale_price)

Code
selected_counts |>
  mutate(
    across(
      .cols = -sale_price,
      .fns = function(x) if_else(x < 2, as.character(x),'2+') |>
        factor(levels = c(0:1,'2+')) ) ) |>
  skim()
Data summary
Name mutate(…)
Number of rows 2930
Number of columns 5
_______________________
Column type frequency:
factor 4
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
full_bath 0 1 FALSE 3 2+: 1600, 1: 1318, 0: 12
half_bath 0 1 FALSE 3 0: 1843, 1: 1062, 2+: 25
fireplaces 0 1 FALSE 3 0: 1422, 1: 1274, 2+: 234
garage_cars 0 1 FALSE 3 2+: 1994, 1: 778, 0: 158

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sale_price 0 1 180796.1 79886.69 12789 129500 160000 213500 755000 ▇▇▁▁▁
Code
grouped_counts <- selected_counts |>
  mutate(
    across(
      .cols = -sale_price,
      .fns = function(x) if_else(x < 2,
                                 as.character(x),
                                 '2+'
                                 ) |>
        factor(levels = c(0:1,'2+')) ) )

grouped_counts
# A tibble: 2,930 × 5
   sale_price full_bath half_bath fireplaces garage_cars
        <int> <fct>     <fct>     <fct>      <fct>      
 1     215000 1         0         2+         2+         
 2     105000 1         0         0          1          
 3     172000 1         1         0          1          
 4     244000 2+        1         2+         2+         
 5     189900 2+        1         1          2+         
 6     195500 2+        1         1          2+         
 7     213500 2+        0         0          2+         
 8     191500 2+        0         0          2+         
 9     236500 2+        0         1          2+         
10     189000 2+        1         1          2+         
# ℹ 2,920 more rows

Displaying counts

The grouped_counts dataset needs to be transformed into a long format for ggplot(). We need to rearrange everything but the sale price.

Code
grouped_counts |>
  pivot_longer(
    cols = -sale_price
  ) |>
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80')

Faceting the count data set

In the pivoted_longer version of the grouped_counts data set, the column name contains the values associated to baths, fireplace and garages. The value column displays the number of units, a factor variable ranging from 1, 0 to 2+.

Centering by amenities

We can center all distributions to their mean price using the scale() function. But we need to scale the distributions by the type of amenity. Notice we got lots of warning messages because of negative values that led to infinite values during the log-10 transformation. In other words, we should not use the log10 transformation.

Code
grouped_counts |>
  pivot_longer(
    cols = -sale_price
  ) |>
  mutate(
    sale_price = scale(sale_price),
    .by = name
  ) |>
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
  scale_x_log10(label = scales::label_dollar()) +
  facet_grid(
    rows = vars(name),
    cols = vars(value)
    ) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )

Changing the log-10 transformation

Replacing the log_10 transformation by declaring x-axis scale as continuous: scale_x_continuous()

Code
grouped_counts |>
  pivot_longer(
    cols = -sale_price
  ) |>
  mutate(
    sale_price = scale(sale_price),
    .by = name
  ) |>
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
  scale_x_continuous() +
  facet_grid(
    rows = vars(name),
    cols = vars(value)
    ) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )

log_transform variable

However, the scaling works better if you log10 scale the variable before using visualizing.

Code
grouped_counts |>
  pivot_longer(
    cols = -sale_price
  ) |>
  mutate(
    sale_price = scale(sale_price |> log()),
    .by = name
  ) |>
  ggplot(aes(x = sale_price)) +
  geom_density(fill = 'grey80') +
  scale_x_continuous() +
  facet_grid(
    rows = vars(name),
    cols = vars(value)
    ) +
  theme_minimal(
    base_family = 'Source Sans Pro',
    base_size = 16
  )